Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

نویسندگان

Yasir Safeer

Atika Mustafa

Anis Noor Ali

چکیده

With the advancement of technology and reduced storage costs, individuals and organizations are tending towards the usage of electronic media for storing textual information and documents. It is time consuming for readers to retrieve relevant information from unstructured document collection. It is easier and less time consuming to find documents from a large collection when the collection is ordered or classified by group or category. The problem of finding best such grouping is still there. This paper discusses the implementation of k-Means clustering algorithm for clustering unstructured text documents that we implemented, beginning with the representation of unstructured text and reaching the resulting set of clusters. Based on the analysis of resulting clusters for a sample set of documents, we have also proposed a technique to represent documents that can further improve the clustering result. Keywords—Information Extraction (IE); Clustering, k-Means Algorithm; Document Classification; Bag-of-words; Document Matching; Document Ranking; Text Mining

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Enhancing Computer Inspection Using Document Clustering for Analysis

In document analysis, Computers having huge amount of data files really creates disorder to analyze it, most of the data consist in those files will be unstructured whose analysis will be difficult. Therefore, we present an approach that reduces the effort of analysis by clustering the document. Clustering is a division of data into groups of similar objects. The clustering techniques used in o...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

In-depth Interactive Visual Exploration for Bridging Unstructured and Structured Document Content

Semi-structured data refers to the combination of unstructured and structured data. Unstructured data is free text in natural language, while structured data is typically stored in tables and following a data schema. Recent statistics shows that 80% of the data generated in the last two years is unstructured. However, one interesting observation is that free text usually comes along with some s...

متن کامل

Akshaya: A Framework for Mining General Knowledge Semantics From Unstructured Text

We report a tool called Akshaya, which implements a framework to mine four types of “general knowledge semantics” (analytical semantics) from unstructured text. The semantics being mined are semantic siblings, topical anchors, topic expansion and topical markers. The framework provides options to embed more such general knowledge semantic mining algorithms into it. We use a term co-occurrence g...

متن کامل

Data Mining from Document-append Nosql

Due to the unstructured nature of modern digital data, NoSQL storages have been adopted by some enterprises as the preferred storage facility. NoSQL storages can store schema-oriented, semi-structured, schema-less data. A type of NoSQL storage is the document-append storage (e.g., CouchDB and Mongo) which has received high adoption due to its flexibility to store JSON-based data and files as at...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

CoRR

دوره abs/1007.4324 شماره

صفحات -

تاریخ انتشار 2010

Clustering Unstructured Data (Flat Files) - An Implementation in Text Mining Tool

نویسندگان

چکیده

منابع مشابه

Enhancing Computer Inspection Using Document Clustering for Analysis

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

In-depth Interactive Visual Exploration for Bridging Unstructured and Structured Document Content

Akshaya: A Framework for Mining General Knowledge Semantics From Unstructured Text

Data Mining from Document-append Nosql

عنوان ژورنال:

اشتراک گذاری